    The Corpus of Basque Simplified Texts (CBST)

    In this paper we present the corpus of Basque simplified texts. This corpus compiles 227 original sentences of science popularisation domain and two simplified versions of each sentence. The simplified versions have been created following different approaches: the structural, by a court translator who considers easy-to-read guidelines and the intuitive, by a teacher based on her experience. The aim of this corpus is to make a comparative analysis of simplified text. To that end, we also present the annotation scheme we have created to annotate the corpus. The annotation scheme is divided into eight macro-operations: delete, merge, split, transformation, insert, reordering, no operation and other. These macro-operations can be classified into different operations. We also relate our work and results to other languages. This corpus will be used to corroborate the decisions taken and to improve the design of the automatic text simplification system for Basque.Cerrar texto de financiación Itziar Gonzalez-Dios's work was funded by a Ph.D. grant from the Basque Government and a postdoctoral grant for the new doctors from the Vice-rectory of Research of the University of the Basque Country (UPV/EHU). We are very grateful to the translator and teacher that simplified the texts. We also want to thank Dominique Brunato, Felice Dell'Orletta and Giulia Venturi for their help with the Italian annotation scheme and their suggestions when analysing the corpus and Oier Lopez de Lacalle for his help with the statistical analysis. We also want to express our gratitude to the anonymous reviewers for their comments and suggestions. This research was supported by the Basque Government (IT344-10), and the Spanish Ministry of Economy and Competitiveness, EXTRECM Project (TIN2013-46616-C2-1-R)

    Compilation of the academic corpus of novels in Basque HARTAeus and its exploitation for the study of academic phraseology

    Se ha compilado un corpus académico de noveles para el euskera comparable con el corpus HARTA-noveles para el español. A partir del corpus se ha extraído una lista de vocabulario académico para el euskera, y sendas listas de colocaciones y fórmulas, a las que se les han asignado funciones discursivas. El objetivo último del proyecto HARTAes-vas, en el que se enmarca este trabajo, es diseñar una herramienta de ayuda a la escritura académica para las dos lenguas centrada en las combinaciones léxicas académicas, que integre diccionario y corpus.An academic corpus of novices was compiled for Basque, comparable to the corpus HARTA-noveles for Spanish. A list of academic Basque vocabulary, collocations and formulas were extracted from the corpus, and then they were assigned discursive functions. The ultimate objective of the HARTAes-vas project, in which this work is framed, is to design a tool to help academic writing for Basque and Spanish focused on academic lexical combinations, integrating lexicographic information and corpora.Este trabajo es parte del proyecto HARTAvas (PID2019-109683GB-C22), financiado por el Ministerio de Ciencia e Innovación

    Perpaus adberbialen agerpena, maiztasuna eta kokapena EPEC‐DEP corpusean

    In this report we present the results obtained analysing the use, frequency of use and the position of adverbial clauses. This analysis has been performed in the Basque Dependency Treebank (BDT). We also have used the descriptive grammars of Euskaltzaindia, the Royal Academy of the Basque.Txosten honetan euskarazko perpaus adberbialen agerpenaren, maiztasunaren eta kokapenaren emaitzak aurkezten dira. Analisi hau egiteko, euskarazko EPEC‐DEP zuhaitz‐bankua edo Treebank‐a eta Euskaltzaindiaren gramatika deskriptiboak erabili dira

    A methodology for the semiautomatic annotation of EPEC-RolSem, a basque corpus labeled at predicative level following the PropBank-Verb Net model

    In this article we describe the methodology developed for the semiautomatic annotation of EPEC-RolSem, a Basque corpus labeled at predicate level following the PropBank-VerbNet model. The methodology presented is the product of detailed theoretical study of the semantic nature of verbs in Basque and of their similarities and differences with verbs in other languages. As part of the proposed methodology, we are creating a Basque lexicon on the PropBank-VerbNet model that we have named the Basque Verb Index (BVI). Our work thus dovetails the general trend toward building lexicons from tagged corpora that is clear in work conducted for other languages. EPEC-RolSem and BVI are two important resources for the computational semantic processing of Basque; as far as the authors are aware, they are also the first resources of their kind developed for Basque. In addition, each entry in BVI is linked to the corresponding verb-entry in well-known resources like PropBank, VerbNet, WordNet, Levin’s Classification and FrameNet. We have also implemented several automatic processes to aid in creating and annotating the BVI, including processes designed to facilitate the task of manual annotation.Lan honetan, EPEC-RolSem corpusa etiketatzeko jarraitu dugun metodologia deskribatuko dugu. EPEC-RolSem corpusa PropBank-VerbNet ereduari jarraiki predikatu-mailan etiketatutako euskarazko corpusa da. Etiketatze-lana aurrera eramateko euskal aditzen izaera semantikoa aztertu eta ingeleseko aditzekin konparatu dugu, azterketa horren emaitza da lan honetan proposatzen dugun metodologia. Metodologiaren atal bat PropBank-VerbNet eredura sortutako euskal aditzen lexikoiaren osaketa izan da, lexikoi hau Basque Verb Index (BVI) deitu dugu. Gure lanak alor honetan beste hizkuntzetan dagoen joera nagusia jarraitzen du, hau da, etiketatutako corpusetatik lexikoiak sortzea. EPEC-RolSem eta BVI oso baliabide garrantzitsuak dira euskararen semantika konputazionalaren alorrean, izan ere, euskararako sortutako mota honetako lehen baliabideak dira. Honetaz guztiaz gain, BVIko sarrera bakoitza PropBank, VerbNet, WordNet, Levinen sailkapena eta FrameNet bezalako baliabide ezagunekin lotua dago. Hainbat prozesu automatiko inplementatu ditugu EPEC-RolSem corpusaren eskuzko etiketatzea laguntzeko eta baita BVI sortzeko eta osatzeko ere

    Euskarazko denbora-egiturak etiketatzeko gidalerroak v2.0

    [EN]To interpret the temporal information on texts, a mark-up language that will code that information is needed, in order to make that information automatically reachable. The most used mark-up language is TimeML (Pustejovsky et al., 2003), which has also been choosen for Basque. In this guidelines we present the Basque version of ISO-TimeML (ISO-TimeML working group, 2008). After having analysed the tags, attributes and values created for English, we describe the most appropriate ones to represent Basque time structures’ information.[EU]Testuetan agertzen den denborazko informazioa interpretatu ahal izateko, informazio hori kodetuko duen markaketa-lengoaia behar da, gerora informazio hori automatikoki baliatu ahal izateko. TimeML (Pustejovsky et al., 2003) etiketatze-lengoaia da zabalduena eta euskararako ere erabili dena. Lan honetan ISO-TimeMLren (ISO-TimeML working group, 2008) euskararako moldaketa aurkezten da; ingeleserako sortutako etiketa, atributu eta horien balioak aztertu ostean, euskarazko denbora-egituren informazioa hobekien islatzen dituztenak deskribatzen dira, hain zuzen ere

    Euskarazko denbora-egiturak etiketatzeko gidalerroak v1.0

    To interpret the temporal information on texts, a mark-up language that will code that information is needed, in order to make that information automatically reachable. The most used mark-up language is TimeML (Pustejovsky et al., 2003), which has also been choosen for Basque. In this guidelines we present the Basque version of ISO-TimeML (ISO-TimeML working group, 2008). After having analysed the tags, attributes and values created for English, we describe the most appropriate ones to represent Basque time structures’ information.Testuetan agertzen den denborazko informazioa interpretatu ahal izateko, informazio hori kodetuko duen markaketa-lengoaia behar da, gerora informazio hori automatikoki baliatu ahal izateko. TimeML (Pustejovsky et al., 2003) etiketatze-lengoaia da zabalduena eta euskararako ere erabili dena. Lan honetan ISO-TimeMLren (ISO-TimeML working group, 2008) euskararako moldaketa aurkezten da; ingeleserako sortutako etiketa, atributu eta horien balioak aztertu ostean, euskarazko denbora-egituren informazioa hobekien islatzen dituztenak deskribatzen dira, hain zuzen ere

    Construcción de un Gold Standard para la Sintaxis Superficial del Euskera

    En este artículo presentamos el proceso de construcción de SF-EPEC, un corpus de 300.000 palabras, sintácticamente anotado, que pretende ser un Gold Standard para el procesamiento sintáctico superficial del euskera. En primer lugar, describimos el conjunto de etiquetas diseñado para este propósito; siendo el euskera una lengua aglutinante, en ocasiones hemos tenido que crear etiquetas sintácticas compuestas. Asimismo, se detallan las distintas fases en la construcción de SF-EPEC.In this paper, we present the process in the construction of SF-EPEC, a 300,000-word corpus syntactically annotated that aims to be a Gold Standard for the surface syntactic processing of Basque. First, the tagset designed for this purpose is described; being Basque an agglutinative language, sometimes complex syntactic tags were needed. We also account for the different phases in the construction of SF-EPEC.PROSA-MED: Procesamiento semántico textual avanzado para la detección de diagnósticos, procedimientos, otros conceptos y sus relaciones en informes Médicos (TIN2016-77820-C3-1-R)

    Kausazko koherentzia-erlazioen azterketa automatikoa euskarazko laburpen zientifikoetan

    Detecting automatically the cause relations of a text may be useful in question answering tasks and event information extraction. The aim of this paper is to study how to detect coherence relations of the cause subgroup (CAUSE, RESULT and PURPOSE). TO achieve this aim we have used the Rhetorical Structure Theory (RST) and some automatic linguistic information from different tools developed by IXA Group. We have used a corpus of 60 scientific abstracts, the Basque RST Treebank (Iruskieta et al., 2013), of different domains: science, medicine and terminology. A linguist has annotated all the signals of that corpus and described the most important problems in such task. To report the reliability of this annotator, two linguists have annotated the signals of the cause subgroup and all the annotations were compared and evaluated. After that, a superannotator has harmonized all the signals of those cause relations. Finally, we show the most important signals for such relations

    Corpusen etiketatze linguistikoa

    In this article, we shall comment on the steps that have to be taken to give a linguistic label to a corpus and the difficulties that appear in this process. Our main objective was to highlight the importance of the labelling when preparing a corpus that is useful for linguistic research, and the need to establish criteria and to take the decisions that this entails. We also explain how semi-automatic methods are applied and how the manual revision that guarantees the quality of the corpus is carried out. Once the corpus has been revised and labelled, it will be useful both for carrying out linguistic analyses and for improving or assessing the linguistic tools and resources, and also for channelling automatic study